View AT40K-FFT_583230.PDF datasheet online --- IC-ON-LINE

Datasheet File OCR Text:

Features
* * * * * * * * * *
Decimation in frequency radix-2 FFT algorithm. 256-point transform. 12-bit fixed point arithmetic. Fixed scaling to avoid numeric overflow. Requires no external memory, i.e. uses on chip RAM and ROM. External access to on-chip RAM for data IO. Clock speed of 21 MHz. 98 ms processing time. 70% utilization of AT40K30 logic resources. 48% utilization of AT40K30 RAM resources.
Description
The Fast Fourier Transform (FFT) processor is a FFT engine developed for the AT40K family of Field Programmable Gate Arrays (FPGAs). The design is based on a decimation-in-frequency radix-2 algorithm and employs in-place computation to optimize memory usage. In order to operate the processor, data must first be loaded into the internal RAM. The processor is then instructed to compute the FFT, overwriting the input data in the RAM with the results. Upon completion of the FFT, the results may be read out from the RAM via the output data port.
AT40K FPGA IP Core AT40K-FFT
Rev. 1132A-08/98
1
Operation and Interfacing
Figure 1. Internal architecture of the FFT processor.
Address Counter Data RAM
8
Address Counter
8
Data input port
12
12
12 12
Data output port
Butterfly
12
Address Counter
8
Twiddle Factor ROM
Control
Control port
Figure 1 shows the architecture of the FFT processor. The design is based on the radix-2 decimation in frequency algorithm and employs in-place computation to optimize memory usage. Figure 2. Example FFT processor configuration 1.
To operate the processor data must first be loaded into the internal RAM. The processor is then instructed to compute the FFT, overwriting the input data in the RAM with the results. On completion of the FFT the results can be read out from the RAM via the output data port.
Real input
ADC
12
FIFO
12
Address 8
ADC
FFT Processor
12
Proc
Data 24
Imag input
FIFO
12
Input Clock
Processor clock
Figure 3. Example FFT processor configuration 2.
Address 8
FFT Processor
Data 24
Proc
Processor clock
2
AT40K-FFT
AT40K-FFT
Interfacing requirements are likely to be highly application specific. The FFT processor has therefore been designed so that users may readily customize the data IO interface to meet their requirements. Two examples of likely interfacing scenarios are depicted in Figure 2 and Figure 3. In the first, the processor is used to perform transforms on real time data generated by a pair of ADC's. This data can either be buffered with a FIFO, or be fed directly into the processor, depending on whether "end to end" transforms are required or not. The results from the FFT processor are then read out to a microprocessor for further processing. In the second scenario both the input and output ports of the FFT processor are mapped into the memory space of a microprocessor. In this configuration the FFT processor can be used as a co-processor to the microprocessor. The FFT processor uses fractional 12-bit fixed point arithmetic. To prevent internal overflows occurring the processor scales all intermediate results by 1/2, thus for a 256 point transform the output data is scaled by 1/256. Furthermore, to ensure that no overflows occur, the input data must observe the following rules: |Re(Input)+jIm(Input)| < 1 or -1Re(Input)<1, Im(Input)=0 where Re(Input) and Im(Input) are the real and imaginary components of the input data. further information on the design the reader is referred to the design source files, especially the schematics. The reader is assumed to be familiar with the various FFT algorithms and their classification. General background information on the FFT algorithm can be found in [1], [2] and [3]. Information on hardware implementations of the FFT can be found in [4].
Butterfly
This function implements the radix-2 decimation in frequency butterfly described by the following equations: A'=(A+B)/2 B'=(A-B)xWT/2 where A and B are the complex inputs to the butterfly, A' and B' are the complex outputs and W T is the complex twiddle factor. The divide by two is not normally found in the butterfly equation, but is included here to prevent numeric overflow in the fixed point implementation. Analysis of this function indicates that it requires the following logical operations. * 2 complex data fetches * 2 complex data stores * 1 complex twiddle factor fetch * 1 complex addition (2 signed adders) * 1 complex subtraction (2 signed subtracters) * 1 complex multiply (4 signed multipliers and 3 signed adders) Unfortunately it is impossible to execute all these functions in a single cycle for the following two reasons. Firstly, the dual ported memory bank is incapable of supporting more than 1 data write and 1 data read per cycle, and secondly the implementation of 4 medium precision array multipliers would exceed the logic resources of a single AT40K FPGA. These problems can be overcome if the butterfly calculation is performed over two clock cycles. This balances the memory and butterfly IO requirements and also halves the number of multipliers required, permitting the design to fit on a single chip.
Detailed Design Description
The architecture of the design is depicted in Figure 1. From this it can be seen that the design contains the following components. * Butterfly * Twiddle factor ROM * Data RAM * Address generators * Controller The design of each of these components is discussed in turn, starting with the butterfly. Following this the design entry, layout and simulation strategies are described. For
3
Figure 4. Architecture of the butterfly.
Re(A), Re(B)
12 12
Delay
12
12
Re(A'), Re(B')
Im (A), Im(B )
12
12
Delay
12
12
Im(A'), Im(B')
13
13
14
12
Cross Bar
13 13 14 12
Im (W T ), -Re(W T )
Figure 4 shows the internal architecture of the 2 clock cycle butterfly. From the input the data flow splits into two main paths. The upper one computes A'=(A+B)/2 and the lower one B'=(A-B)xWT/2. The results from these two data paths are then multiplexed onto the output data bus using tristate buffers. In upper path the input data is fed into an accumulator. This sums it every two cycles, divides the result by two and rounds to 12 bits, giving A'=(A+B)/2. The result is then sent through a short register based delay to align it with the result from the slightly longer lower data path used to compute B'. In the B' data path the input data is fed into a differencing circuit which computes -A+B. The real and imaginary 13 bit results are then fed through a cross bar switch which presents the difference of the imaginary components (Im(A)+Im(B)) to both 13x12 bit signed multipliers in the first cycle, and the difference of the real components (Re(A)+Re(B)) in the second. In the first cycle both multipliers multiply by the imaginary component of the twiddle factor (Im(WT)) and in the second by the negated real component (-Re(WT)). (Note that the reason for negating the real component of the twiddle factor is discussed in section Twiddle factor ROM.) The outputs of the two multipliers are therefore Im(A+B)xIm(WT), Re(-A+B)x-Re(WT) and Re(-A+B)xIm(WT), Im(-A+B)x-Re(WT). The 25 bit results from the multipliers are then truncated to 14 bits. One data stream is fed into an accumulator which sums the results, divides by two and truncates the result to 12 bits with rounding, giving, Re(B')=(Im(-A+B)xIm(WT)+Re(-A+B)x Re(WT))/2=Re((A-B)xWT). The other is fed into a differencing circuit which subtracts the two results, divides by two and truncates the result to 12 bits with rounding, yielding,
Im(B')=(-Re(-A+B)xIm(WT)+Im(-A+B)xRe(WT))/2=Im((A-B)xWT). To improve performance the butterfly is more heavily pipelined than is shown in Figure 4. This results in an input to output latency for the butterfly of 9 cycles.
Twiddle factor ROM
The twiddle factor ROM is organized as 256 x 12 bit words. Each of the 128 complex twiddle factors is stored in the ROM with alternate words being used for the real and imaginary components. To improve the accuracy of the most common twiddle factor, i.e. W0=1, the real components of the twiddle factors are negated in the ROM. Thus W 0 is stored as -1 rather than 0.995; 0.995 being the closest approximation to 1 permitted by a 12 bit two's complement fractional integer. The twiddle factor ROM was generated using Atmel's macro generator. The input data file for the macro generator was generated using a MatLab script.
Data RAM
The data RAM is configured as two 256 x 12 bit dual ported memories. One is used to hold the real components and the other to hold the imaginary components. The availability of dual ported, as opposed to single ported, memory significantly improves the utilization of on chip RAM. Use of single ported RAM would have required twice as much memory to permit the concurrent data fetches and stores required by the butterfly. To achieve maximum memory efficiency the FFT algorithm uses "in place" computation, i.e. the data stored after each butterfly computation is placed back in the same locations that the input data was read from. This requirement has no impact on performance, but does somewhat complicate the design of the data address generators.
4
AT40K-FFT
AT40K-FFT
Address Generators
The design employs two address generators to control the input and output of data from the data RAM. One is used to generate the read addresses and the other the write addresses. Both produce the same address pattern, though they are skewed in phase by 9 cycles, i.e. the length of the butterfly pipeline. The data address generators are based on an 8 bit accumulator where the carry out signal is connected to the carry in signal. The accumulator increment is provide by a Johnson counter which increments every 256 cycles. The data address generators feature tristate outputs to disconnect them from the address busses while data is transferred into or out of the chip. The twiddle factor address generator produces a different address sequence. Here an 8 bit binary counter is employed, the output bits of which are ANDed with a masking function. This masking function is provided by an 8-bit shift register which steps through the following pattern 00000001b, 00000011b, 00000111b, etc., once every 256 cycles.
Design Capture and Layout
The design was entered via schematic capture using Workview Office. This design method was employed to provide good design visibility to users. Extensive use was made of user macros to ensure high performance layouts for sub components. The device targeted for the design was the AT40K30. Utilization of the part was as follows: Logic Cells: 1114/1600 = 69.6% RAM Cells: 48/100 = 48%
Design Simulation
The design was simulated using the gate level simulator ViewSim. Both functional and post layout simulations were performed to confirm the correct operation of the design. A MatLab script was employed to generated the input data files for the simulation. This same script was also used to read the results files from the simulation and check them against the MatLab FFT function.
Timing Analysis
Timing analysis of the design indicated a maximum clock speed of 21.2 MHz when using the AT40K30 part. Investigations into the limiting data path revealed this to be the delay from the twiddle factor address counter output, through the asynchronous twiddle factor ROM to the butterfly's twiddle factor input. Clearly, conversion of the twiddle ROM from asynchronous to synchronous operation would be a starting point for improving the design's performance.
Controller
The controller synchronizes the operation of all the other components. It contains a state machine closely coupled to a counter to generate all the control signals required. The state machine provides a simple two line control port (START and BUSY) to enable external devices to initiate processing and monitor its completion.
5
Design Analysis
Design Performance
The design requires Nlog2N clock cycles to compute the FFT, where N is the transform length. In this design N is fixed at 256, requiring a total of 2048 clock cycles. Timing analysis of the design (see section Timing Analysis) indicates a maximum clock rate of 21 MHz. This gives a total processing time for the FFT of 98 s. It should be noted that the design does not support concurrent data IO and processing. Consequently, in real applications the user must also allow time to transfer data in and out. 256 clock cycles are required to write data into the internal RAM, yielding a minimum data input time of 12 s at 21 MHz. Determination of the time required to read data out from the device is somewhat more difficult, partly because read accesses to the on chip RAM are asynchronous, and partly because the user may not require the complete output data set. However, it is not unrealistic to assume a similar data transfer time to the input. This gives a likely total data IO time of approximately 24 s at 21 MHz. System performance could potentially be improved by providing suitable buffering to permit concurrent data processing and IO. Double buffering designs would most probably have to use external memory devices. The only technique to radically improve performance is to consider other FFT architectures, probably involving multiple FPGAs. The most probable architecture is the "pipeline FFT" processor described in [4]. This requires log2N butterflies arranged in a line and interspersed with delay lines. Data is then fed continuously through the pipeline to compute the FFT. Using the current butterfly design such an FFT processor would require multiple FPGAs, each containing one or two butterflies and their associated line delays. For an transform length of N such an architecture would produce an log2N fold performance increase compared to the current design, (i.e. an 8 times improvement for a 256 point FFT.) An alternative strategy would be to attempt to reduce the complexity of the butterfly by using bit serial arithmetic, thus permitting more butterflies to be implemented on a single FPGA.
Suggested Techniques to Improve Performance
Three main methods exist to increase the rate at which AT40K FPGAs can perform FFTs, namely: * Increasing the design clock speed. * Double buffer data to hide data IO time. * Consider other FFT architectures. The simplest method of increasing performance is to increase the clock frequency of the design. Timing analysis (see section Timing Analysis) indicates that the limiting path is through the asynchronous ROM block. Conversion of the ROM block to synchronous operation would improve performance somewhat. In general the clock rate of the design is limited by two factors: the loading on the internal address and data busses and the speed of the multipliers. The first factor is affected by both the length of the transform and its precision. Decreasing these reduces the size of the dual ported memory, leading to reduced bus loadings and increasing the data IO bandwidth between the butterfly and memory. The second factor is affected only by the precision of the transform. Increasing the precision increases the size of the multipliers, reducing the system's performance. It should be noted that increasing either the data precision or the transform length will lead to a reduction in performance. For large transform lengths the use of an external dual ported memory should be considered, this may also provide faster data transfer times by reducing on chip bus loadings.
Recommendations to Improve Functionality
Two potential improvements to the design are suggested: * Use of block floating point. * Reduction of the size of the twiddle factor ROM. Currently the design uses fixed scaling by 1/2 at the output of the butterfly to prevent numeric overflow. However, this can significantly reduce the dynamic range of the output data when the input signals are weak. An alternative approach is to use a block floating point strategy [4]. In this case the scaling by 1/2 is only included in each FFT column of calculations if the results from the previous column are likely to cause overflow in the current column. This leads to an improvement in the dynamic range of the output data. The additional logic required in the butterfly to implement a block floating point is a conditional divide by 2, i.e. a simple shifter. Inclusion of this function in the butterfly should neither significantly increase its size or degrade its performance. In addition to changes to the butterfly some extra logic is required in the controller unit to operate the shifter. The overall size of the design could be reduced by investigating techniques to decrease the size of the twiddle factor ROM. Currently this stores half of the unit circle. By including logic to manipulate the input address and sign of the output data it should be possible to reduce the size of the ROM to store only a quarter or eighth of the unit circle. This is, however, only likely to be of significant benefit for large FFTs.
6
AT40K-FFT
AT40K-FFT
7
Atmel Headquarters
Corporate Headquarters
2325 Orchard Parkway San Jose, CA 95131 TEL (408) 441-0311 FAX (408) 487-2600
Atmel Operations
Atmel Colorado Springs
1150 E. Cheyenne Mtn. Blvd. Colorado Springs, CO 80906 TEL (719) 576-3300 FAX (719) 540-1759
Europe
Atmel U.K., Ltd. Coliseum Business Centre Riverside Way Camberley, Surrey GU15 3YL England TEL (44) 1276-686677 FAX (44) 1276-686697
Atmel Rousset
Zone Industrielle 13106 Rousset Cedex, France TEL (33) 4 42 53 60 00 FAX (33) 4 42 53 60 01
Asia
Atmel Asia, Ltd. Room 1219 Chinachem Golden Plaza 77 Mody Road Tsimshatsui East Kowloon, Hong Kong TEL (852) 27219778 FAX (852) 27221369
Japan
Atmel Japan K.K. Tonetsu Shinkawa Bldg., 9F 1-24-8 Shinkawa Chuo-ku, Tokyo 104-0033 Japan TEL (81) 3-3523-3551 FAX (81) 3-3523-7581
Fax-on-Demand
North America: 1-(800) 292-8635 International: 1-(408) 441-0732
e-mail
literature@atmel.com
Web Site
http://www.atmel.com
BBS
1-(408) 436-4309
(c) Atmel Corporation 1998. Atmel Cor poration makes no warranty for the use of its products, other than those expressly contained in the Company's standard warranty which is detailed in Atmel's Terms and Conditions located on the Company's website. The Company assumes no responsibility for any errors which may appear in this document, reserves the right to change devices or specifications detailed herein at any time without notice, and does not make any commitment to update the information contained herein. No licenses to patents or other intellectual proper ty of Atmel are granted by the Company in connection with the sale of Atmel products, expressly or by implication. Atmel's products are not authorized for use as critical components in life suppor t devices or systems. Marks bearing
(R)
and/or
TM
are registered trademarks and trademarks of Atmel Corporation. Printed on recycled paper.
1132A-08/98/15M
Terms and product names in this document may be trademarks of others.

▲Up To Search▲

Price & Availability of AT40K-FFT

	To Download AT40K-FFT Datasheet File
If you can't view the Datasheet, Please click here to try to view without PDF Reader .